{
 "cells": [
  {
   "cell_type": "markdown",
   "source": [
    "# Terms metadata in Dataverse installations and effect of multiple-license update"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "## Goals of this notebook"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "This notebook explores the \"Terms\" metadata of datasets in 56 known Dataverse installations as of August 2021. This snapshot may be used to learn more about the effects of the \"multiple-license\" update as it makes its way to the Dataverse software (https://github.com/IQSS/dataverse/pull/7920) and is applied to Dataverse installations.\n",
    "\n",
    "The \"multiple-license\" update will change how depositors enter \"Terms\" metadata (that is, metadata about how the data should or must be used). A goal of this update is to encourage the use and machine-readable application of standard licenses to datasets in Dataverse installations. This should make it easier for other people and systems to determine how the data can and can't be used.\n",
    "\n",
    "Additionally, with this update, if a dataset's Terms metadata includes values in any of the fields in the software's \"Terms of Use\" panel, such as Confidentiality Declaration and Special Permissions, the software will consider those terms to be \"Custom Terms,\" even when a CC0 waiver (or a CCBY license in some forked Dataverse installations) was also applied.\n",
    "\n",
    "For more information about the update, see the Multiple License Consensus Proposal ([Google Doc](https://docs.google.com/document/d/10htygglMdlABYWqtcZpqd8sHOwIe6sLL_UJtTv8NEKw)) and the following GitHub issues and pull request:\n",
    "\n",
    "- https://github.com/IQSS/dataverse/issues/7742\n",
    "- https://github.com/IQSS/dataverse/issues/7440\n",
    "- https://github.com/IQSS/dataverse/pull/7920\n",
    "\n",
    "So the two questions this notebook seeks to answer are:\n",
    "- How much of the dataset metadata published by each Dataverse installation includes any Terms metadata? This snapshot could be the first of ongoing efforts to track how Terms metadata in Dataverse installations change over time and particularly after each installation applies this and similar updates, as a way to measure the success of these changes and justify them.\n",
    "- When installations apply this update, which ones have published datasets with CC0 waivers (or, for some forked installations, CCBY licenses) plus any of the eight \"Terms of Use\" fields filled? The software will consider these datasets to have \"Custom Terms\". And how many of these datasets has each installation published? The community might use these numbers to get a sense of the scale of the change from this update, and the numbers might encourage each installation to look into the effects of this change. For example, a closer look at the metadata could reveal how the things that depositors enter into the \"Custom Terms\" fields do or do not conflict with the chosen CC waivers or licenses."
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "## Methods"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "The Terms metadata published by 56 Dataverse installations is recorded in the terms_metadata.tab file. That tabular file was created by:\n",
    "- Downloading all zip files in the dataset at https://doi.org/10.7910/DVN/DCDKZQ. That dataset contains the JSON metadata files collected between August 4 and August 7, 2021 using a Python script. The methods for getting this metadata are described in the dataset's metadata.\n",
    "- Using another Python script to extract the JSON metadata files in each Zip file into a directory.\n",
    "- Parsing the terms metadata of every JSON metadata file in that directory into a single CSV file using the [parse_terms_metadata.py](https://github.com/jggautier/dataverse-scripts/blob/main/get-dataverse-metadata/parse_metadata_fields/parse_terms_metadata.py) script in the GitHub repository at https://github.com/jggautier/dataverse-scripts.\n",
    "- Parsing the value of each JSON files' \"publisher\" key, which indicates the installation that published the metadata in each JSON file, into a single CSV file using the [parse_basic_metadata.py](https://github.com/jggautier/dataverse_scripts/blob/main/get-dataverse-metadata/parse_metadata_fields/parse_basic_metadata.py) script in the GitHub repository at https://github.com/jggautier/dataverse-scripts.\n",
    "- Joining both CSV files into a single CSV file that contains the persistent URLs, dataset version IDs, installation names (\"publishers\"), and Terms metadata for every published dataset version, and converting that CSV file into a .tab file (so that it's easier to make this notebook accessible using the Dataverse software's [binder integration](https://guides.dataverse.org/en/5.7/admin/integrations.html?highlight=integrations#binder).\n",
    "\n",
    "This notebook uses the pandas and numpy Python packages to filter, reshape, and provide simple analysis of the data in the terms_metadata.tab file in order to help answer the two questions."
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "## Exploration"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import Python packages\n",
    "from functools import reduce\n",
    "import numpy as np\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "source": [
    "### Importing and preparing the data"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import data as a dataframe\n",
    "allVersions = pd.read_csv('terms_metadata.tab', sep='\\t', na_filter = False, parse_dates=['datasetPublicationDate', 'versionCreateTime'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "outputs": [
    {
     "data": {
      "text/plain": "   datasetVersionId                        persistentUrl  \\\n0            256773      https://doi.org/10.15454/GHAUFR   \n1             52619      https://doi.org/10.15454/TLXRVW   \n2            170336      https://doi.org/10.15454/CX70I3   \n3              1200  https://doi.org/10.21410/7E4/4WG94W   \n4            198147   https://doi.org/10.7910/DVN/5PRYPC   \n\n             persistent_id license  termsOfUse confidentialityDeclaration  \\\n0      doi:10.15454/GHAUFR    NONE   CC BY 2.0                              \n1      doi:10.15454/TLXRVW    NONE   CC BY 2.0                              \n2      doi:10.15454/CX70I3    NONE   CC BY 2.0                              \n3  doi:10.21410/7E4/4WG94W    NONE                                          \n4   doi:10.7910/DVN/5PRYPC     CC0  CC0 Waiver                              \n\n  specialPermissions restrictions citationRequirements depositorRequirements  \\\n0                                                                              \n1                                                                              \n2                                                                              \n3                                                                              \n4                                                                              \n\n   ... availabilityStatus contactForAccess sizeOfCollection studyCompletion  \\\n0  ...                                                                        \n1  ...                                                                        \n2  ...                                                                        \n3  ...                                                                        \n4  ...                                                                        \n\n  datasetPublicationDate         versionCreateTime versionState  \\\n0             2019-02-15 2019-09-18 22:00:00+00:00     RELEASED   \n1             2019-09-11 2018-09-12 03:19:24+00:00     RELEASED   \n2             2019-01-21 2019-01-21 10:54:41+00:00     RELEASED   \n3             2020-05-05 2020-05-13 16:06:28+00:00     RELEASED   \n4             2020-05-27 2020-06-17 23:49:50+00:00     RELEASED   \n\n  majorVersionNumber minorVersionNumber           publisher  \n0                  3                  0  Portail Data INRAE  \n1                  1                  0  Portail Data INRAE  \n2                  2                  0  Portail Data INRAE  \n3                  2                  1     data.sciencespo  \n4                  4                  0   Harvard Dataverse  \n\n[5 rows x 25 columns]",
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>datasetVersionId</th>\n      <th>persistentUrl</th>\n      <th>persistent_id</th>\n      <th>license</th>\n      <th>termsOfUse</th>\n      <th>confidentialityDeclaration</th>\n      <th>specialPermissions</th>\n      <th>restrictions</th>\n      <th>citationRequirements</th>\n      <th>depositorRequirements</th>\n      <th>...</th>\n      <th>availabilityStatus</th>\n      <th>contactForAccess</th>\n      <th>sizeOfCollection</th>\n      <th>studyCompletion</th>\n      <th>datasetPublicationDate</th>\n      <th>versionCreateTime</th>\n      <th>versionState</th>\n      <th>majorVersionNumber</th>\n      <th>minorVersionNumber</th>\n      <th>publisher</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>256773</td>\n      <td>https://doi.org/10.15454/GHAUFR</td>\n      <td>doi:10.15454/GHAUFR</td>\n      <td>NONE</td>\n      <td>CC BY 2.0</td>\n      <td></td>\n      <td></td>\n      <td></td>\n      <td></td>\n      <td></td>\n      <td>...</td>\n      <td></td>\n      <td></td>\n      <td></td>\n      <td></td>\n      <td>2019-02-15</td>\n      <td>2019-09-18 22:00:00+00:00</td>\n      <td>RELEASED</td>\n      <td>3</td>\n      <td>0</td>\n      <td>Portail Data INRAE</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>52619</td>\n      <td>https://doi.org/10.15454/TLXRVW</td>\n      <td>doi:10.15454/TLXRVW</td>\n      <td>NONE</td>\n      <td>CC BY 2.0</td>\n      <td></td>\n      <td></td>\n      <td></td>\n      <td></td>\n      <td></td>\n      <td>...</td>\n      <td></td>\n      <td></td>\n      <td></td>\n      <td></td>\n      <td>2019-09-11</td>\n      <td>2018-09-12 03:19:24+00:00</td>\n      <td>RELEASED</td>\n      <td>1</td>\n      <td>0</td>\n      <td>Portail Data INRAE</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>170336</td>\n      <td>https://doi.org/10.15454/CX70I3</td>\n      <td>doi:10.15454/CX70I3</td>\n      <td>NONE</td>\n      <td>CC BY 2.0</td>\n      <td></td>\n      <td></td>\n      <td></td>\n      <td></td>\n      <td></td>\n      <td>...</td>\n      <td></td>\n      <td></td>\n      <td></td>\n      <td></td>\n      <td>2019-01-21</td>\n      <td>2019-01-21 10:54:41+00:00</td>\n      <td>RELEASED</td>\n      <td>2</td>\n      <td>0</td>\n      <td>Portail Data INRAE</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>1200</td>\n      <td>https://doi.org/10.21410/7E4/4WG94W</td>\n      <td>doi:10.21410/7E4/4WG94W</td>\n      <td>NONE</td>\n      <td></td>\n      <td></td>\n      <td></td>\n      <td></td>\n      <td></td>\n      <td></td>\n      <td>...</td>\n      <td></td>\n      <td></td>\n      <td></td>\n      <td></td>\n      <td>2020-05-05</td>\n      <td>2020-05-13 16:06:28+00:00</td>\n      <td>RELEASED</td>\n      <td>2</td>\n      <td>1</td>\n      <td>data.sciencespo</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>198147</td>\n      <td>https://doi.org/10.7910/DVN/5PRYPC</td>\n      <td>doi:10.7910/DVN/5PRYPC</td>\n      <td>CC0</td>\n      <td>CC0 Waiver</td>\n      <td></td>\n      <td></td>\n      <td></td>\n      <td></td>\n      <td></td>\n      <td>...</td>\n      <td></td>\n      <td></td>\n      <td></td>\n      <td></td>\n      <td>2020-05-27</td>\n      <td>2020-06-17 23:49:50+00:00</td>\n      <td>RELEASED</td>\n      <td>4</td>\n      <td>0</td>\n      <td>Harvard Dataverse</td>\n    </tr>\n  </tbody>\n</table>\n<p>5 rows × 25 columns</p>\n</div>"
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "allVersions.head(5)"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of dataset versions: 382596\n",
      "Number of unique datasets: 155719\n"
     ]
    }
   ],
   "source": [
    "print('Number of dataset versions: %s' % (len(allVersions)))\n",
    "print('Number of unique datasets: %s' % (len(allVersions.persistentUrl.unique())))"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "**About dataset versioning and Terms metadata**\n",
    "\n",
    "While we know from experience that some datasets published in Dataverse installations contain versions where each version needs different Terms metadata, such as a dataset where each version represents a wave of the same longitudinal research study and data re-users should consider each dataset version's license, let's assume that for a large majority of datasets, the Terms metadata of their latest versions have replaced the metadata of their previous versions.\n",
    "\n",
    "For example, if a dataset has two published versions, the first version has no CC0 waiver, and the second version has a CC0 waiver, the CC0 waiver in the latest version is what should apply to the dataset (and one could argue that previous version should have been deaccessioned).\n",
    "\n",
    "Since dataset versioning in the Dataverse software is designed more as a way to record improvements to a dataset over time, as opposed to a way to publish disparate but connected datasets (such as waves in a longitudinal study), we'll assume that this is how most people use dataset versioning when publishing datasets in Dataverse installations. Additionally, the Dataverse software favors the latest published version of each dataset, tending to expose the metadata in the latest published version more than it exposes the metadata of previously published versions (such as through indexing and search result displays, metadata distributing, and how dataset PIDs direct users to the latest published version).\n",
    "\n",
    "So let's get and consider the Terms metadata for only the latest version of each dataset."
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get only metadata for the latest versions of each dataset\n",
    "latestversion = (allVersions\n",
    "                 .iloc[allVersions\n",
    "                    .groupby('persistentUrl')['datasetVersionId']\n",
    "                    .agg(pd.Series.idxmax)]\n",
    "                 .sort_values(by=['publisher'], inplace=False, ascending=True)\n",
    "                 .reset_index(drop=True, inplace=False))"
   ]
  },
  {
   "cell_type": "markdown",
   "source": [
    "Let's prepare the rest of the data so it's easier to work with."
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "outputs": [],
   "source": [
    "# Replace any blank values with NaN, making it easier to count and sort later\n",
    "latestversion = latestversion.replace(r'^\\s*$', np.nan, regex=True)\n",
    "\n",
    "# Dartmouth's installation exports JSON dataset metadata files with \"Root\" in the \"publisher\" key. Replace \"Root\" with \"Dartmouth\"\n",
    "latestversion['publisher'] = latestversion['publisher'].replace(['Root'],'Dartmouth')"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "There shouldn't be too many unique values entered in the license fields, so let's see what's there."
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "outputs": [
    {
     "data": {
      "text/plain": "array(['NONE', 'CC0', 'CC BY', nan, 'CCBY'], dtype=object)"
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Let's see what's been entered in the license fields\n",
    "latestversion.license.unique()"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "From earlier experiences working with this metadata, I know that both the string \"NONE\" and null values (NaN) have been used to indicate that there's nothing in the license field. Let's replace all 'NONE' strings with NaN."
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "outputs": [
    {
     "data": {
      "text/plain": "   datasetVersionId                        persistentUrl persistent_id  \\\n0                59  https://doi.org/10.25825/FK2/8YKSQV           NaN   \n1               139  https://doi.org/10.25825/FK2/VXVPVP           NaN   \n2               115  https://doi.org/10.25825/FK2/VNAJ1I           NaN   \n3                30  https://doi.org/10.25825/FK2/VEO88P           NaN   \n4                26  https://doi.org/10.25825/FK2/VEANNS           NaN   \n\n  license                                         termsOfUse  \\\n0     NaN  <b>Acceptance of Terms</b><br></br>\\nThe follo...   \n1     NaN  <b>Acceptance of Terms</b><br></br>\\nThe follo...   \n2     NaN  <b>Acceptance of Terms</b><br></br>\\nThe follo...   \n3     NaN  <b>Acceptance of Terms</b><br></br>\\nThe follo...   \n4     NaN  <b>Acceptance of Terms</b><br></br>\\nThe follo...   \n\n  confidentialityDeclaration specialPermissions restrictions  \\\n0                        NaN                NaN          NaN   \n1                        NaN                NaN          NaN   \n2                        NaN                NaN          NaN   \n3                        NaN                NaN          NaN   \n4                        NaN                NaN          NaN   \n\n  citationRequirements depositorRequirements  ... availabilityStatus  \\\n0                  NaN                   NaN  ...                NaN   \n1                  NaN                   NaN  ...                NaN   \n2                  NaN                   NaN  ...                NaN   \n3                  NaN                   NaN  ...                NaN   \n4                  NaN                   NaN  ...                NaN   \n\n  contactForAccess sizeOfCollection studyCompletion datasetPublicationDate  \\\n0              NaN              NaN             NaN             2019-11-26   \n1              NaN              NaN             NaN             2019-10-17   \n2              NaN              NaN             NaN             2019-11-26   \n3              NaN              NaN             NaN             2019-11-26   \n4              NaN              NaN             NaN             2019-11-26   \n\n          versionCreateTime versionState majorVersionNumber  \\\n0 2019-11-14 18:21:51+00:00     RELEASED                  1   \n1 2021-01-06 16:02:23+00:00     RELEASED                  2   \n2 2019-11-18 21:23:56+00:00     RELEASED                  1   \n3 2019-11-13 15:13:57+00:00     RELEASED                  1   \n4 2019-11-12 20:34:20+00:00     RELEASED                  1   \n\n  minorVersionNumber       publisher  \n0                  0  ACSS Dataverse  \n1                  1  ACSS Dataverse  \n2                  0  ACSS Dataverse  \n3                  0  ACSS Dataverse  \n4                  0  ACSS Dataverse  \n\n[5 rows x 25 columns]",
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>datasetVersionId</th>\n      <th>persistentUrl</th>\n      <th>persistent_id</th>\n      <th>license</th>\n      <th>termsOfUse</th>\n      <th>confidentialityDeclaration</th>\n      <th>specialPermissions</th>\n      <th>restrictions</th>\n      <th>citationRequirements</th>\n      <th>depositorRequirements</th>\n      <th>...</th>\n      <th>availabilityStatus</th>\n      <th>contactForAccess</th>\n      <th>sizeOfCollection</th>\n      <th>studyCompletion</th>\n      <th>datasetPublicationDate</th>\n      <th>versionCreateTime</th>\n      <th>versionState</th>\n      <th>majorVersionNumber</th>\n      <th>minorVersionNumber</th>\n      <th>publisher</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>59</td>\n      <td>https://doi.org/10.25825/FK2/8YKSQV</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>&lt;b&gt;Acceptance of Terms&lt;/b&gt;&lt;br&gt;&lt;/br&gt;\\nThe follo...</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>...</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>2019-11-26</td>\n      <td>2019-11-14 18:21:51+00:00</td>\n      <td>RELEASED</td>\n      <td>1</td>\n      <td>0</td>\n      <td>ACSS Dataverse</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>139</td>\n      <td>https://doi.org/10.25825/FK2/VXVPVP</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>&lt;b&gt;Acceptance of Terms&lt;/b&gt;&lt;br&gt;&lt;/br&gt;\\nThe follo...</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>...</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>2019-10-17</td>\n      <td>2021-01-06 16:02:23+00:00</td>\n      <td>RELEASED</td>\n      <td>2</td>\n      <td>1</td>\n      <td>ACSS Dataverse</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>115</td>\n      <td>https://doi.org/10.25825/FK2/VNAJ1I</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>&lt;b&gt;Acceptance of Terms&lt;/b&gt;&lt;br&gt;&lt;/br&gt;\\nThe follo...</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>...</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>2019-11-26</td>\n      <td>2019-11-18 21:23:56+00:00</td>\n      <td>RELEASED</td>\n      <td>1</td>\n      <td>0</td>\n      <td>ACSS Dataverse</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>30</td>\n      <td>https://doi.org/10.25825/FK2/VEO88P</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>&lt;b&gt;Acceptance of Terms&lt;/b&gt;&lt;br&gt;&lt;/br&gt;\\nThe follo...</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>...</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>2019-11-26</td>\n      <td>2019-11-13 15:13:57+00:00</td>\n      <td>RELEASED</td>\n      <td>1</td>\n      <td>0</td>\n      <td>ACSS Dataverse</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>26</td>\n      <td>https://doi.org/10.25825/FK2/VEANNS</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>&lt;b&gt;Acceptance of Terms&lt;/b&gt;&lt;br&gt;&lt;/br&gt;\\nThe follo...</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>...</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>2019-11-26</td>\n      <td>2019-11-12 20:34:20+00:00</td>\n      <td>RELEASED</td>\n      <td>1</td>\n      <td>0</td>\n      <td>ACSS Dataverse</td>\n    </tr>\n  </tbody>\n</table>\n<p>5 rows × 25 columns</p>\n</div>"
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "latestversion = latestversion.replace('NONE', np.nan)\n",
    "latestversion.head(5)"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "Finally, lets remove the datasetVersionId column and reorder the remaining columns so that the columns with basic info about each dataset are together and are followed by columns with the Terms metadata."
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "outputs": [],
   "source": [
    "# Remove versionID column and reorder remaining columns\n",
    "latestversion = latestversion[[\n",
    "    'publisher', 'persistentUrl', 'majorVersionNumber', 'minorVersionNumber', 'versionCreateTime', 'license', 'termsOfUse', 'confidentialityDeclaration', 'specialPermissions',\n",
    "    'restrictions', 'citationRequirements', 'depositorRequirements', 'conditions', 'disclaimer'\n",
    "]]"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of datasets: 155719\n",
      "Number of installations: 56\n"
     ]
    }
   ],
   "source": [
    "print('Number of datasets: %s' % (len(latestversion)))\n",
    "print('Number of installations: %s' % (len(latestversion.publisher.unique())))"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "### Question 1: How many datasets published by each Dataverse installation include any information about how the data can be used?\n",
    "\n",
    "We'll be exploring the Terms metadata of 155,719 datasets in 56 Dataverse installations.\n",
    "\n",
    "First let's get the counts of published datasets in each installation. Then we can compare those counts to the counts of published datasets that have any kind of Terms metadata."
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "outputs": [
    {
     "data": {
      "text/plain": "                           count of datasets\npublisher                                   \nPortail Data INRAE                     72010\nHarvard Dataverse                      44057\nopenforestdata.pl                      10116\nUNC Dataverse                           4537\nScholars Portal Dataverse               4096",
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>count of datasets</th>\n    </tr>\n    <tr>\n      <th>publisher</th>\n      <th></th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>Portail Data INRAE</th>\n      <td>72010</td>\n    </tr>\n    <tr>\n      <th>Harvard Dataverse</th>\n      <td>44057</td>\n    </tr>\n    <tr>\n      <th>openforestdata.pl</th>\n      <td>10116</td>\n    </tr>\n    <tr>\n      <th>UNC Dataverse</th>\n      <td>4537</td>\n    </tr>\n    <tr>\n      <th>Scholars Portal Dataverse</th>\n      <td>4096</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "countDatasetsByInstallation = latestversion.value_counts(subset=['publisher']).to_frame('count of datasets')\n",
    "countDatasetsByInstallation.head(5)"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "How many of those datasets have metadata with any Terms metadata? Let's take a look at a few rows from the latestversion dataframe to see how we might query it."
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "outputs": [
    {
     "data": {
      "text/plain": "        publisher                        persistentUrl  majorVersionNumber  \\\n0  ACSS Dataverse  https://doi.org/10.25825/FK2/8YKSQV                   1   \n1  ACSS Dataverse  https://doi.org/10.25825/FK2/VXVPVP                   2   \n2  ACSS Dataverse  https://doi.org/10.25825/FK2/VNAJ1I                   1   \n3  ACSS Dataverse  https://doi.org/10.25825/FK2/VEO88P                   1   \n4  ACSS Dataverse  https://doi.org/10.25825/FK2/VEANNS                   1   \n\n   minorVersionNumber         versionCreateTime license  \\\n0                   0 2019-11-14 18:21:51+00:00     NaN   \n1                   1 2021-01-06 16:02:23+00:00     NaN   \n2                   0 2019-11-18 21:23:56+00:00     NaN   \n3                   0 2019-11-13 15:13:57+00:00     NaN   \n4                   0 2019-11-12 20:34:20+00:00     NaN   \n\n                                          termsOfUse  \\\n0  <b>Acceptance of Terms</b><br></br>\\nThe follo...   \n1  <b>Acceptance of Terms</b><br></br>\\nThe follo...   \n2  <b>Acceptance of Terms</b><br></br>\\nThe follo...   \n3  <b>Acceptance of Terms</b><br></br>\\nThe follo...   \n4  <b>Acceptance of Terms</b><br></br>\\nThe follo...   \n\n  confidentialityDeclaration specialPermissions restrictions  \\\n0                        NaN                NaN          NaN   \n1                        NaN                NaN          NaN   \n2                        NaN                NaN          NaN   \n3                        NaN                NaN          NaN   \n4                        NaN                NaN          NaN   \n\n  citationRequirements depositorRequirements conditions disclaimer  \n0                  NaN                   NaN        NaN        NaN  \n1                  NaN                   NaN        NaN        NaN  \n2                  NaN                   NaN        NaN        NaN  \n3                  NaN                   NaN        NaN        NaN  \n4                  NaN                   NaN        NaN        NaN  ",
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>publisher</th>\n      <th>persistentUrl</th>\n      <th>majorVersionNumber</th>\n      <th>minorVersionNumber</th>\n      <th>versionCreateTime</th>\n      <th>license</th>\n      <th>termsOfUse</th>\n      <th>confidentialityDeclaration</th>\n      <th>specialPermissions</th>\n      <th>restrictions</th>\n      <th>citationRequirements</th>\n      <th>depositorRequirements</th>\n      <th>conditions</th>\n      <th>disclaimer</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>ACSS Dataverse</td>\n      <td>https://doi.org/10.25825/FK2/8YKSQV</td>\n      <td>1</td>\n      <td>0</td>\n      <td>2019-11-14 18:21:51+00:00</td>\n      <td>NaN</td>\n      <td>&lt;b&gt;Acceptance of Terms&lt;/b&gt;&lt;br&gt;&lt;/br&gt;\\nThe follo...</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>ACSS Dataverse</td>\n      <td>https://doi.org/10.25825/FK2/VXVPVP</td>\n      <td>2</td>\n      <td>1</td>\n      <td>2021-01-06 16:02:23+00:00</td>\n      <td>NaN</td>\n      <td>&lt;b&gt;Acceptance of Terms&lt;/b&gt;&lt;br&gt;&lt;/br&gt;\\nThe follo...</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>ACSS Dataverse</td>\n      <td>https://doi.org/10.25825/FK2/VNAJ1I</td>\n      <td>1</td>\n      <td>0</td>\n      <td>2019-11-18 21:23:56+00:00</td>\n      <td>NaN</td>\n      <td>&lt;b&gt;Acceptance of Terms&lt;/b&gt;&lt;br&gt;&lt;/br&gt;\\nThe follo...</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>ACSS Dataverse</td>\n      <td>https://doi.org/10.25825/FK2/VEO88P</td>\n      <td>1</td>\n      <td>0</td>\n      <td>2019-11-13 15:13:57+00:00</td>\n      <td>NaN</td>\n      <td>&lt;b&gt;Acceptance of Terms&lt;/b&gt;&lt;br&gt;&lt;/br&gt;\\nThe follo...</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>ACSS Dataverse</td>\n      <td>https://doi.org/10.25825/FK2/VEANNS</td>\n      <td>1</td>\n      <td>0</td>\n      <td>2019-11-12 20:34:20+00:00</td>\n      <td>NaN</td>\n      <td>&lt;b&gt;Acceptance of Terms&lt;/b&gt;&lt;br&gt;&lt;/br&gt;\\nThe follo...</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "latestversion.head(5)"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "outputs": [],
   "source": [
    "# Create dataframe containing only datasets with something in their license fields or any Terms of Use fields\n",
    "datasetsWithAnyTerms = (latestversion\n",
    "                    .query('license.notnull() or termsOfUse.notnull() or confidentialityDeclaration.notnull() or specialPermissions.notnull() or restrictions.notnull() or citationRequirements.notnull() or depositorRequirements.notnull() or conditions.notnull() or disclaimer.notnull()')\n",
    "                   .reset_index(drop = True, inplace = False)\n",
    "                   )"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "How many of those datasets are in each installation?"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "outputs": [
    {
     "data": {
      "text/plain": "                    count of datasets with any Terms\npublisher                                           \nPortail Data INRAE                             72005\nHarvard Dataverse                              34447\nopenforestdata.pl                              10042\nUNC Dataverse                                   4284\nRIN Dataverse                                   4016",
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>count of datasets with any Terms</th>\n    </tr>\n    <tr>\n      <th>publisher</th>\n      <th></th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>Portail Data INRAE</th>\n      <td>72005</td>\n    </tr>\n    <tr>\n      <th>Harvard Dataverse</th>\n      <td>34447</td>\n    </tr>\n    <tr>\n      <th>openforestdata.pl</th>\n      <td>10042</td>\n    </tr>\n    <tr>\n      <th>UNC Dataverse</th>\n      <td>4284</td>\n    </tr>\n    <tr>\n      <th>RIN Dataverse</th>\n      <td>4016</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "countDatasetsWithTermsByInstallation = datasetsWithAnyTerms.value_counts(subset=['publisher']).to_frame('count of datasets with any Terms')\n",
    "countDatasetsWithTermsByInstallation.head(5)"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "outputs": [
    {
     "data": {
      "text/plain": "                                      count of datasets  \\\npublisher                                                 \nACSS Dataverse                                      119   \nADA Dataverse                                      1583   \nASU Library Research Data Repository                 26   \nAUSSDA                                             1147   \nAbacus Data Network                                2191   \n\n                                      count of datasets with any Terms  \\\npublisher                                                                \nACSS Dataverse                                                     119   \nADA Dataverse                                                     1583   \nASU Library Research Data Repository                                26   \nAUSSDA                                                            1147   \nAbacus Data Network                                               2190   \n\n                                      percent of datasets with any Terms  \npublisher                                                                 \nACSS Dataverse                                                  1.000000  \nADA Dataverse                                                   1.000000  \nASU Library Research Data Repository                            1.000000  \nAUSSDA                                                          1.000000  \nAbacus Data Network                                             0.999544  ",
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>count of datasets</th>\n      <th>count of datasets with any Terms</th>\n      <th>percent of datasets with any Terms</th>\n    </tr>\n    <tr>\n      <th>publisher</th>\n      <th></th>\n      <th></th>\n      <th></th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>ACSS Dataverse</th>\n      <td>119</td>\n      <td>119</td>\n      <td>1.000000</td>\n    </tr>\n    <tr>\n      <th>ADA Dataverse</th>\n      <td>1583</td>\n      <td>1583</td>\n      <td>1.000000</td>\n    </tr>\n    <tr>\n      <th>ASU Library Research Data Repository</th>\n      <td>26</td>\n      <td>26</td>\n      <td>1.000000</td>\n    </tr>\n    <tr>\n      <th>AUSSDA</th>\n      <td>1147</td>\n      <td>1147</td>\n      <td>1.000000</td>\n    </tr>\n    <tr>\n      <th>Abacus Data Network</th>\n      <td>2191</td>\n      <td>2190</td>\n      <td>0.999544</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Combine the two dataframes so we can compare the count of datasets with any Terms to the count of all datasets\n",
    "dataframes = [\n",
    "    countDatasetsByInstallation,\n",
    "    countDatasetsWithTermsByInstallation,\n",
    "    ]\n",
    "\n",
    "countDatasetsWithAndWithoutTermsByInstallation = reduce(lambda left, right: left.join(right, how='outer'), dataframes)\n",
    "\n",
    "# Format the dataframe to replace NA values with 0 and cast the counts as integers\n",
    "countDatasetsWithAndWithoutTermsByInstallation = (countDatasetsWithAndWithoutTermsByInstallation\n",
    "    .fillna(0)\n",
    "    .astype('int32')\n",
    "    )\n",
    "\n",
    "# Add a column showing the percentage of datasets with any Terms\n",
    "countDatasetsWithAndWithoutTermsByInstallation['percent of datasets with any Terms'] = countDatasetsWithAndWithoutTermsByInstallation['count of datasets with any Terms'] / countDatasetsWithAndWithoutTermsByInstallation['count of datasets']\n",
    "\n",
    "countDatasetsWithAndWithoutTermsByInstallation.head(5)"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "I think it might be helpful for some installations to see which datasets have no Terms metadata, so lets query for that and export the results to a CSV file."
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of datasets with no Terms metadata: 12810\n"
     ]
    }
   ],
   "source": [
    "datasetsWithNoTerms = (latestversion\n",
    "                       .query('(license == \"NONE\" or license.isnull()) and (termsOfUse.isnull() and confidentialityDeclaration.isnull() and specialPermissions.isnull() and restrictions.isnull() and citationRequirements.isnull() and depositorRequirements.isnull() and conditions.isnull() and disclaimer.isnull())')\n",
    "                       .reset_index(drop = True, inplace = False)\n",
    "                       )\n",
    "\n",
    "print('Number of datasets with no Terms metadata: %s' % (len(datasetsWithNoTerms)))\n",
    "\n",
    "# Export dataframe to a CSV file\n",
    "datasetsWithNoTerms.to_csv('datasetsWithNoTerms.csv', index=False)"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "### Question 2: When installations apply the changes in the \"multiple license\" software update, how many datasets will have \"Custom Terms\"? And how many of these datasets has each installation published?\n",
    "\n",
    "Again, the community might use these numbers to get a sense of the scale of the change from this update, and the numbers might encourage each installation to look into the effects of this change. For example, a closer look at the metadata could reveal how the things that depositors enter into the \"Custom Terms\" fields do or do not conflict with the chosen CC waivers or licenses."
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of datasets with something in the license field plus something in one or more Terms of Use fields: 621\n"
     ]
    }
   ],
   "source": [
    "# Create a new dataframe containing the datasets that have a value in the licenses field (that is, their license field is not null) plus one or more values in the Terms of Use fields\n",
    "datasetsWithCustomTerms = (latestversion\n",
    "                            .query('(license.notnull()) and (confidentialityDeclaration.notnull() or specialPermissions.notnull() or restrictions.notnull() or citationRequirements.notnull() or depositorRequirements.notnull() or conditions.notnull() or disclaimer.notnull())')\n",
    "                            .reset_index(drop = True, inplace = False)\n",
    "                            )\n",
    "\n",
    "print('Number of datasets with something in the license field plus something in one or more Terms of Use fields: %s' % (len(datasetsWithCustomTerms)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "outputs": [
    {
     "data": {
      "text/plain": "       publisher                        persistentUrl  majorVersionNumber  \\\n0        CROSSDA      https://doi.org/10.23669/JVNVNR                   1   \n1  DR-NTU (Data)   https://doi.org/10.21979/N9/DHYM9H                   1   \n2  DR-NTU (Data)   https://doi.org/10.21979/N9/OF5ZDK                   1   \n3          DaRUS  https://doi.org/10.18419/darus-1851                   1   \n4      Dataverse  https://doi.org/10.48510/FK2/DUIKUT                   1   \n\n   minorVersionNumber         versionCreateTime license    termsOfUse  \\\n0                   1 2021-07-19 13:23:04+00:00     CC0    CC0 Waiver   \n1                   0 2020-01-13 03:23:43+00:00     CC0    CC0 Waiver   \n2                   0 2020-01-13 03:56:25+00:00     CC0    CC0 Waiver   \n3                   0 2021-05-14 18:02:49+00:00   CC BY  CC BY Waiver   \n4                   0 2021-05-27 13:41:48+00:00     CC0    CC0 Waiver   \n\n  confidentialityDeclaration specialPermissions  \\\n0                        NaN                NaN   \n1                        NaN                NaN   \n2                        NaN                NaN   \n3                        NaN                NaN   \n4                        NaN                NaN   \n\n                                        restrictions  \\\n0                                                NaN   \n1  To download these audio files, researchers mus...   \n2  Researchers must give their name and current r...   \n3                                                NaN   \n4                                                NaN   \n\n                                citationRequirements depositorRequirements  \\\n0  The citation for this study is:\\nVlašiček, D.,...                   NaN   \n1  To cite this dataset:\\nStyles, Suzy J; Bin Mus...                   NaN   \n2  To cite this dataset: \\nStyles, Suzy; Travers ...                   NaN   \n3                         See \"Related Publication\".                   NaN   \n4                                                NaN                   NaN   \n\n                                          conditions  \\\n0                                                NaN   \n1                                                NaN   \n2                                                NaN   \n3  The code and software included under \"software...   \n4                                                NaN   \n\n                                          disclaimer  \n0                                                NaN  \n1                                                NaN  \n2                                                NaN  \n3                                                NaN  \n4   If you use the published dataset or parts of ...  ",
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>publisher</th>\n      <th>persistentUrl</th>\n      <th>majorVersionNumber</th>\n      <th>minorVersionNumber</th>\n      <th>versionCreateTime</th>\n      <th>license</th>\n      <th>termsOfUse</th>\n      <th>confidentialityDeclaration</th>\n      <th>specialPermissions</th>\n      <th>restrictions</th>\n      <th>citationRequirements</th>\n      <th>depositorRequirements</th>\n      <th>conditions</th>\n      <th>disclaimer</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>CROSSDA</td>\n      <td>https://doi.org/10.23669/JVNVNR</td>\n      <td>1</td>\n      <td>1</td>\n      <td>2021-07-19 13:23:04+00:00</td>\n      <td>CC0</td>\n      <td>CC0 Waiver</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>The citation for this study is:\\nVlašiček, D.,...</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>DR-NTU (Data)</td>\n      <td>https://doi.org/10.21979/N9/DHYM9H</td>\n      <td>1</td>\n      <td>0</td>\n      <td>2020-01-13 03:23:43+00:00</td>\n      <td>CC0</td>\n      <td>CC0 Waiver</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>To download these audio files, researchers mus...</td>\n      <td>To cite this dataset:\\nStyles, Suzy J; Bin Mus...</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>DR-NTU (Data)</td>\n      <td>https://doi.org/10.21979/N9/OF5ZDK</td>\n      <td>1</td>\n      <td>0</td>\n      <td>2020-01-13 03:56:25+00:00</td>\n      <td>CC0</td>\n      <td>CC0 Waiver</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>Researchers must give their name and current r...</td>\n      <td>To cite this dataset: \\nStyles, Suzy; Travers ...</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>DaRUS</td>\n      <td>https://doi.org/10.18419/darus-1851</td>\n      <td>1</td>\n      <td>0</td>\n      <td>2021-05-14 18:02:49+00:00</td>\n      <td>CC BY</td>\n      <td>CC BY Waiver</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>See \"Related Publication\".</td>\n      <td>NaN</td>\n      <td>The code and software included under \"software...</td>\n      <td>NaN</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>Dataverse</td>\n      <td>https://doi.org/10.48510/FK2/DUIKUT</td>\n      <td>1</td>\n      <td>0</td>\n      <td>2021-05-27 13:41:48+00:00</td>\n      <td>CC0</td>\n      <td>CC0 Waiver</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>If you use the published dataset or parts of ...</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "datasetsWithCustomTerms.head(5)"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "How many of these datasets are in each Dataverse installation?"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "outputs": [
    {
     "data": {
      "text/plain": "                                               count of datasets with \"Custom Terms\"\npublisher                                                                           \nHarvard Dataverse                                                                394\nUNC Dataverse                                                                    131\nScholars Portal Dataverse                                                         51\nPortail Data INRAE                                                                 6\nPeking University Open Research Data Platform                                      6",
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>count of datasets with \"Custom Terms\"</th>\n    </tr>\n    <tr>\n      <th>publisher</th>\n      <th></th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>Harvard Dataverse</th>\n      <td>394</td>\n    </tr>\n    <tr>\n      <th>UNC Dataverse</th>\n      <td>131</td>\n    </tr>\n    <tr>\n      <th>Scholars Portal Dataverse</th>\n      <td>51</td>\n    </tr>\n    <tr>\n      <th>Portail Data INRAE</th>\n      <td>6</td>\n    </tr>\n    <tr>\n      <th>Peking University Open Research Data Platform</th>\n      <td>6</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "countDatasetsWithCustomTermsByInstallation = datasetsWithCustomTerms.value_counts(subset=['publisher']).to_frame('count of datasets with \"Custom Terms\"')\n",
    "countDatasetsWithCustomTermsByInstallation.head(5)"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "Finally, let's join this dataframe to the dataframe showing all datasets in each installation."
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [],
   "source": [
    "dataframes = [\n",
    "    countDatasetsByInstallation,\n",
    "    countDatasetsWithCustomTermsByInstallation]\n",
    "\n",
    "countDatasetsWithAndWithoutCustomTermsByInstallation = reduce(lambda left, right: left.join(right, how='outer'), dataframes)\n",
    "\n",
    "# Format allCountsByInstallation dataframe to replace NA values with 0, cast values as integers and sort by greatest number of datasets with \"Custom Terms\"\n",
    "countDatasetsWithAndWithoutCustomTermsByInstallation = (countDatasetsWithAndWithoutCustomTermsByInstallation\n",
    "                           .fillna(0)\n",
    "                           .astype('int32')\n",
    "                           .sort_values([\n",
    "                                'count of datasets with \"Custom Terms\"',\n",
    "                                'count of datasets'], ascending=False)\n",
    "                           )\n",
    "\n",
    "# Add a column showing the percentage of datasets with any Terms\n",
    "countDatasetsWithAndWithoutCustomTermsByInstallation['percentage of datasets with \"Custom Terms\"'] = countDatasetsWithAndWithoutCustomTermsByInstallation['count of datasets with \"Custom Terms\"'] / countDatasetsWithAndWithoutCustomTermsByInstallation['count of datasets']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": "                                               count of datasets  \\\npublisher                                                          \nHarvard Dataverse                                          44057   \nUNC Dataverse                                               4537   \nScholars Portal Dataverse                                   4096   \nPortail Data INRAE                                         72010   \nPeking University Open Research Data Platform                320   \n\n                                               count of datasets with \"Custom Terms\"  \\\npublisher                                                                              \nHarvard Dataverse                                                                394   \nUNC Dataverse                                                                    131   \nScholars Portal Dataverse                                                         51   \nPortail Data INRAE                                                                 6   \nPeking University Open Research Data Platform                                      6   \n\n                                               percentage of datasets with \"Custom Terms\"  \npublisher                                                                                  \nHarvard Dataverse                                                                0.008943  \nUNC Dataverse                                                                    0.028874  \nScholars Portal Dataverse                                                        0.012451  \nPortail Data INRAE                                                               0.000083  \nPeking University Open Research Data Platform                                    0.018750  ",
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>count of datasets</th>\n      <th>count of datasets with \"Custom Terms\"</th>\n      <th>percentage of datasets with \"Custom Terms\"</th>\n    </tr>\n    <tr>\n      <th>publisher</th>\n      <th></th>\n      <th></th>\n      <th></th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>Harvard Dataverse</th>\n      <td>44057</td>\n      <td>394</td>\n      <td>0.008943</td>\n    </tr>\n    <tr>\n      <th>UNC Dataverse</th>\n      <td>4537</td>\n      <td>131</td>\n      <td>0.028874</td>\n    </tr>\n    <tr>\n      <th>Scholars Portal Dataverse</th>\n      <td>4096</td>\n      <td>51</td>\n      <td>0.012451</td>\n    </tr>\n    <tr>\n      <th>Portail Data INRAE</th>\n      <td>72010</td>\n      <td>6</td>\n      <td>0.000083</td>\n    </tr>\n    <tr>\n      <th>Peking University Open Research Data Platform</th>\n      <td>320</td>\n      <td>6</td>\n      <td>0.018750</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "countDatasetsWithAndWithoutCustomTermsByInstallation.head(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "source": [
    "Finally, it might also be helpful for some installations to see which datasets will be considered to have \"Custom Terms\", so lets export that dataframe to a CSV file."
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "# Export dataframe to a CSV file\n",
    "datasetsWithCustomTerms.to_csv('datasetsWithCustomTerms.csv', index=False)"
   ]
  }
 ],
 "metadata": {
  "hide_input": false,
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}